-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19080][SQL] simplify data source analysis #16269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #70092 has finished for PR 16269 at commit
|
|
Test build #70878 has finished for PR 16269 at commit
|
|
Test build #70880 has finished for PR 16269 at commit
|
|
Test build #70882 has finished for PR 16269 at commit
|
|
Test build #70884 has finished for PR 16269 at commit
|
|
Test build #71188 has started for PR 16269 at commit |
|
Test build #71834 has finished for PR 16269 at commit
|
|
Test build #71839 has finished for PR 16269 at commit
|
|
Test build #71840 has finished for PR 16269 at commit
|
|
Test build #71861 has finished for PR 16269 at commit
|
|
Test build #71862 has finished for PR 16269 at commit
|
|
Will read it carefully tomorrow. Thanks for the great work! |
|
Test build #71908 has finished for PR 16269 at commit
|
|
Test build #71932 has finished for PR 16269 at commit
|
|
Test build #72306 has finished for PR 16269 at commit
|
| t.isInstanceOf[Range] || | ||
| t == OneRowRelation || | ||
| t.isInstanceOf[LocalRelation] => | ||
| failAnalysis(s"Inserting into an RDD-based table is not allowed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To other reviewers: This is moved to PreWriteCheck
| plan.foreachUp { | ||
| case o if !o.resolved => failAnalysis(s"unresolved operator ${o.simpleString}") | ||
| case _ => | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This movement is great, since extendedCheckRules could output a better error message.
| s"column(s) but the inserted data has " + | ||
| s"${query.output.size + numStaticPartitions} column(s), including " + | ||
| s"$numStaticPartitions partition column(s) having constant value(s).") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To other reviewers: This is moved to PreprocessTableInsertion.
After the above two moves, def checkAnalysis(plan: LogicalPlan) does not have any DDL/DML error handling. It becomes cleaner
| /** | ||
| * Insert some data into a table. | ||
| * Insert some data into a table. Note that this plan is unresolved and has to be replaced by the | ||
| * concrete implementations during analysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To other reviewers, InsertIntoTable is a unified unresolved node for representing INSERT. After completing the resolution, it will be replaced to InsertIntoDataSourceCommand, InsertIntoHadoopFsRelationCommand or InsertIntoHiveTable.
| def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { | ||
| case CreateTable(tableDesc, mode, None) if DDLUtils.isHiveTable(tableDesc) => | ||
| val cmd = CreateTableCommand(tableDesc, ifNotExists = mode == SaveMode.Ignore) | ||
| ExecutedCommandExec(cmd) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To other reviewers: this is moved to HiveAnalysis.
| tableDesc, | ||
| mode, | ||
| query) | ||
| ExecutedCommandExec(cmd) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To other reviewers: the above two are moved to DataSourceAnalysis
| query) | ||
| ExecutedCommandExec(cmd) :: Nil | ||
|
|
||
| case c: CreateTempViewUsing => ExecutedCommandExec(c) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To other reviewers: Since CreateTempViewUsing is a RunnableCommand, this line is useless
|
|
||
| case InsertIntoTable(l @ LogicalRelation(t: InsertableRelation, _, _), | ||
| part, query, overwrite, false) if part.isEmpty => | ||
| ExecutedCommandExec(InsertIntoDataSourceCommand(l, query, overwrite)) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the other reviewers: this is moved to DataSourceAnalysis
|
|
||
| /** | ||
| * Create a table and optionally insert some data into it. Note that this plan is unresolved and | ||
| * has to be replaced by the concrete implementations during analyse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the other reviewers, CreateTable is a unified unresolved logical representation of CREATE TABLE (AS SELECT). It will be replaced by CreateHiveTableAsSelectCommand, CreateTableCommand, CreateDataSourceTableCommand or CreateDataSourceTableAsSelectCommand.
| // We don't want `table` in children as sometimes we don't want to transform it. | ||
| override def children: Seq[LogicalPlan] = query :: Nil | ||
| override def output: Seq[Attribute] = Seq.empty | ||
| override lazy val resolved: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan After this change, we are unable to reach the check in checkForStreaming. BTW, it sounds like we do not have any test case to cover these scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we will resolve CreateTable, InsertIntoTable to concrete commands, so the check can still work.
|
|
||
| /** | ||
| * Create a table and optionally insert some data into it. Note that this plan is unresolved and | ||
| * has to be replaced by the concrete implementations during analyse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: analyse -> analysis
| * | ||
| * Note that, this rule must be run after `PreprocessTableInsertion`. | ||
| * Note that, this rule must be run after `PreprocessTableCreation` and | ||
| * `PreprocessTableInsertion`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add two more rules OrcConversions and ParquetConversions? HiveAnalysis must be run after them too.
| * | ||
| * Note that, this rule must be run after [[PreprocessTableInsertion]]. | ||
| * Note that, this rule must be run after `PreprocessTableCreation` and | ||
| * `PreprocessTableInsertion`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same updates are needed here.
|
LGTM except a few minor comments. The major concern is about our error checking for structured streaming. It sounds like the test case coverage in that area is weak. |
|
LGTM pending test |
|
Test build #72444 has finished for PR 16269 at commit
|
|
thanks for the review, merging to master! |
## What changes were proposed in this pull request? The current way of resolving `InsertIntoTable` and `CreateTable` is convoluted: sometimes we replace them with concrete implementation commands during analysis, sometimes during planning phase. And the error checking logic is also a mess: we may put it in extended analyzer rules, or extended checking rules, or `CheckAnalysis`. This PR simplifies the data source analysis: 1. `InsertIntoTable` and `CreateTable` are always unresolved and need to be replaced by concrete implementation commands during analysis. 2. The error checking logic is mainly in 2 rules: `PreprocessTableCreation` and `PreprocessTableInsertion`. ## How was this patch tested? existing test. Author: Wenchen Fan <[email protected]> Closes apache#16269 from cloud-fan/ddl.
What changes were proposed in this pull request?
The current way of resolving
InsertIntoTableandCreateTableis convoluted: sometimes we replace them with concrete implementation commands during analysis, sometimes during planning phase.And the error checking logic is also a mess: we may put it in extended analyzer rules, or extended checking rules, or
CheckAnalysis.This PR simplifies the data source analysis:
InsertIntoTableandCreateTableare always unresolved and need to be replaced by concrete implementation commands during analysis.PreprocessTableCreationandPreprocessTableInsertion.How was this patch tested?
existing test.